Improved high quality sand fly assemblies enabled by ultra low input long read sequencing

Phlebotomine sand flies are the vectors of leishmaniasis, a neglected tropical disease. High-quality reference genomes are an important tool for understanding the biology and eco-evolutionary dynamics underpinning disease epidemiology. Previous leishmaniasis vector reference sequences were limited by sequencing technologies available at the time and inadequate for high-resolution genomic inquiry. Here, we present updated reference assemblies of two sand flies, Phlebotomus papatasi and Lutzomyia longipalpis. These chromosome-level assemblies were generated using an ultra-low input library protocol, PacBio HiFi long reads, and Hi-C technology. The new P. papatasi reference has a final assembly span of 351.6 Mb and contig and scaffold N50s of 926 kb and 111.8 Mb, respectively. The new Lu. longipalpis reference has a final assembly span of 147.8 Mb and contig and scaffold N50s of 1.09 Mb and 40.6 Mb, respectively. Benchmarking Universal Single-Copy Orthologue (BUSCO) assessments indicated 94.5% and 95.6% complete single copy insecta orthologs for P. papatasi and Lu. longipalpis. These improved assemblies will serve as an invaluable resource for future genomic work on phlebotomine sandflies.


Background & Summary
Phlebotomine sand flies (family Psychodidae, order Diptera) include several genera of hematophagous arthropods that vector important emerging and re-emerging infectious diseases.They transmit bacterial, viral, and, most notably, the protozoan pathogen Leishmania, to humans and animals.Leishmaniasis is a group of diseases that range in clinical manifestation, from self-healing cutaneous lesions to disfiguring mucocutaneous ulcers to fatal visceral disease.Clinical tropisms can be highly dependent on infective species and vectoring sand fly.Over 90 species of sand flies found across Latin America, Africa, the eastern Mediterranean, Southeast Asia, and Europe have been implicated as vectors for approximately 20 species of Leishmania parasites that cause leishmaniasis 1,2 .
Phlebotomus papatasi vectors Leishmania major, an etiological agent of cutaneous leishmaniasis, across North Africa, the Middle East, and the Indian subcontinent 3 .It is a restrictive vector in that it can only transmit a single Leishmania species, Le. major.However, P. papatasi also transmits viral febrile illnesses across its distribution 4,5 .Lutzomyia longipalpis is the major vector responsible for transmission of the visceral leishmaniasis causing parasite, Leishmania infantum, in the Americas 6 .Lu. longipalpis is a permissive vector in the laboratory, transmitting several Leishmania species, however in nature it only transmits Le. infantum 7 .Lu. longipalpis has a wide geographic distribution inhabiting a range of diverse ecological habitats and has garnered interest as a species complex.Others have observed differences in spot numbers, pheromones, mating songs, and noted reproductive isolation between different populations collected throughout Brazil 8 .Leishmaniasis pathogenesis is thought to be dependent on complex host, vector, and parasite interactions and, although the epidemiological implications of a Lu.longipalpis species complex remain unclear, understanding the molecular underpinnings that that lead to vector competence, reproductive isolation and adaptation is critical from an epidemiological and disease control perspective.
In mosquito research, high-quality reference genomes have enabled inquiries into population genetics and metagenomics, identification of gene markers of senescence, vector competence, insecticide resistance, and experimental gene drive approaches to vector control.These have ultimately improved understanding and management of the vector in the disease transmission cycle 9 .Unfortunately, the fragmented nature of current sand fly references slowed similar inquiries for Leishmania transmission.
Previous reference genomes for P. papatasi and Lu.longipalpis 10 suffered very low contiguity.Using the best sequencing technology at the time, read lengths were limited to ~400 bp -too short to span many repeats.More damaging to assembly contiguity, previous library protocol DNA input minimums required DNA to be pooled from many individuals, inserting many different haplotypes into the assembly algorithm.Genome heterozygosity could not be controlled for by inbreeding in sand flies, and haplotype sequence variation -for example, a short insertion polymorphism -caused assembly tools designed for a single haplotype to create sequence gaps in areas of uncertainty.Together, these constraints led the genome assemblies for P. papatasi and Lu.longipalpis to be the 2 nd and 3 rd worst available in VectorBase 11 , with contig N50 lengths at 5,795 bp and 7,481 bp, respectively.For reference, across all genomes in VectorBase at the time, the median assembly contig N50 was 51,691 bp.Additionally, no Hi-C or chromosome scale data was available, and these fragmented genome assemblies were inadequate for many genome analyses.
Here, we update these two important sand fly vector genome references leveraging a decade's worth of technological advances.Specifically, very high quality long read sequences of Q20 or even Q30 are available in lengths longer than the previous assemblies contigs.Second, Hi-C technologies have become de rigueur and have higher chromosomal completion rates when paired with the significantly longer contigs generated by high quality long read assembly.Finally, an ultra-low input library protocol developed by Pacific Biosciences 12 enabled the sequencing of a single individual sand fly.This greatly simplified assembly of sequence information from only 2 haplotypes derived from a single individual rather than many haplotypes from a pool of individuals.A small compromise, as only 30 ng of genomic DNA can be isolated from a single sand fly male, is the use of whole genome amplification.Together these three techniques have generated the greatly improved reference assemblies we describe here.

Genome Sequence Report
The genomes of P. papatasi and Lu.Longipalpis were each sequenced from a single male from colonies maintained at the University of Notre Dame.The P. papatasi colony was established in the 1970s from the Israeli strain and the Lu.Longipalpis colony was established in 1988 from the Jacobina strain caught from Bahia State, Brazil.P. papatasi sequencing generated 102x coverage and Lu.longipalpis sequencing generated 53x coverage of PacBio HiFi long reads.Additional material from other individuals from the same colonies was used for Hi-C library preparation.
The final P. papatasi assembly has a span of 351.6 Mb, 646 scaffolds, and a scaffold N50 of 111.8 Mb.The final Lu.Longipalpis assembly has a span of 147.8 Mb, 4 scaffolds, and a scaffold N50 of 40.6 Mb (Table 1, Figs. 1 & 2).The updated assemblies improved upon several deficiencies from the previous assemblies (Table 2).Compared to the previous assemblies, contiguity has improved over 100-fold and these larger contigs are placed in a chromosomal context.
Two genome annotations are available for each species.The first is a new NCBI RefSeq 13 annotation based on not just this assembly but also new long read transcript data generated to support new annotation.Gene numbers derived from this annotation are shown in Table 2 and BUSCO analysis in Table 3.The number of complete single copy insecta single copy orthologs increased by ~10%.That is, an additional 10% of genes that were previously incomplete or missing are now easily accessible in the improved assembly.In addition to this updated annotation resource, we wished to preserve previous annotations, especially user contributed curated annotations, which connect the genome to previously published analyses.To preserve previous annotation information, we utilized the new open-source pipeline Transfer-annotations 14 developed by VectorBase engineers to iteratively run Liftoff 15 to accurately transfer previous annotations to new VectorBase Apollo browser tracks and generate a downloadable GFF3 annotation file for each species.

Sample acquisition and nucleic acid extraction.
Single males were chosen for sequencing to capture the heterogametic sex chromosomes, and to ensure only high quality long read sequence data from a single diploid genome was presented to the assembly software for facile assembly.A single male adult sand fly was aspirated from each of our P. papatasi and Lu.Longipalpis colonies and frozen at −80 °C.Each specimen was chilled in liquid nitrogen and ground into a fine powder preceding DNA extraction using a modified Puregene ® kit extraction protocol (Qiagen, Hilden, Germany).DNA was eluted in 30 μl of TE buffer and concentration was assessed using a Nanodrop Spectrophotometer.
Long read library construction and sequencing.Pacific Biosciences HiFi Libraries were constructed using an ultra-low input library protocol 12 .The P. papatasi library was prepared at Pacific Biosciences using a pre-production version of the library kit.The Lu. Longipalpis library was prepared at the UC Davis DNA technologies core using the commercially available SMRTbell gDNA Sample Amplification Kit (Pacific Biosciences, Menlo Park, CA; Cat.#101-980-000) and the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences; Cat.#100-938-900) according to the manufacturer's instructions.Briefly, approximately 10 kb sheared DNA by the Megaruptor 3 system (Diagenode, Belgium; Cat.#B06010003) was used for removal of single-strand overhangs at 37 °C for 15 minutes, DNA damage repair at 37 °C for 30 minutes, end repair and A-tailing at 20 °C for 30 minutes and 65 °C for 30 minutes, and ligation of overhang adapters at 20 °C for 60 minutes.To prepare for library amplification by PCR, the library was purified with ProNex beads (Promega, Madison, WI; Cat.# NG2002) for two PCR amplification conditions at 15 cycles each then another ProNex beads purification.Purified amplified DNA from both reactions were pooled in equal mass quantities for another round of enzymatic steps that included DNA repair, end repair/A-tailing, overhang adapter ligation, and purification with ProNex Beads.The PippinHT system (Sage Science, Beverly, MA; Cat # HPE7510) was used for SMRTbell library size selection to remove fragments <6-10 kb.The 10-11 kb average HiFi SMRTbell library was sequenced using one 8 M SMRT cell, Sequel IIe sequencing chemistry 2.0, and 30-hour movies each on a PacBio Sequel II sequencer.(b) Phlebotomus papatasi assembly JANPWV01.Both plots were generated using blobtoolkit 43 .
Long read assembly.The draft Lu. longipalpis genome assembly was assembled using hifiasm 16 from HiFi data generated from a single male individual at the UC Davis Genome Core using the ultra-low input protocol.
Filtering input reads to have an average quality >Q30 was found to give a more contiguous final assembly for this dataset than Q20 filtered reads and was used for the final assembly.The draft genome assembly for P. papatasi was generated at Pacific Biosciences based on HiFi reads generated at Pacific Biosciences with a library made from (b) Phlebotomus papatasi assembly JANPWV01 with no-hits filtered out.Both plots were generated using blobtoolkit 43 .a single adult male individual using an ultra-low input library kit.The long-read assembly was performed using HGAP and Falcon 17 .

Old
3D sequencing and assembly.The high-quality drafts were upgraded to chromosome-length using Hi-C data derived from different male individuals from the same respective colonies at the University of Notre Dame.
The in situ Hi-C libraries were generated as described in Rao, Huntley et al. 18 .Briefly, whole insect bodies were crosslinked with 1% formaldehyde for 10 minutes at room temperature.Nuclei were extracted via grinding and permeabilized using SDS.DNA was digested with a cocktail of Csp6I and MseI, and the ends of restriction fragments were labeled using biotinylated nucleotides then ligated.After reversal of crosslinks, ligated DNA was purified and sheared to a length of ~400 bp, at which point ligation junctions were pulled down with streptavidin beads and prepped for Illumina sequencing.The resulting libraries were sequenced using Illumina NovaSeq 6000 instruments.Hi-C data were aligned to the draft references using Juicer 19 , and 3D assembly for both species was performed using 3D-DNA pipeline 20 .In view of the large number of alternative haplotypes incorporated in the draft assembly as separate sequences 21 , 3D-DNA pipeline was run with the "merge" step option for Lu.longipalpis (see Matthews et al. 22 ) to remove alt haplotypes from the anchored portion of the assembly.The resulting assemblies were reviewed and curated using Juicebox Assembly Tools 23 .The resulting contact maps (Fig. 3) can be explored interactively at multiple resolutions via Juicebox.js 24at the DNA Zoo website pages 25,26 .
Removal of non-chromosomal sequences from Lu. longipalpis.During BUSCO analysis the Lu.longipalpis draft assembly contained high numbers of duplicate BUSCO genes.This was due to the presence of alternative haplotype sequences in the unanchored portion of the assemblies.As expected, removing unanchored sequences during annotation greatly reduced the duplicates.
Gene annotation lift-over.We used the pipeline Transfer-Annotations 14 and the program Liftoff 15 to move previous gene annotations and manual curations to the new reference assembly.Liftoff distance and flank parameters were determined by incrementally changing them to find the combination with the lowest flank number and the fewest missing features.We used agat_sp_fix_cds_phases 27 to calculate phase information and identify any transferred gene models that are incomplete or altered.AGAT's agat_sp_extract_sequences 27 was used to extract CDS protein sequences for the transferred genes on the new genome.The Transfer-annotations pipeline then identifies missing CDS regions, and it produces a corrected GFF3 with metadata regarding model validity in the GFF3 attributes column.This process includes if a protein sequence contains stop codons, if it matches the original sequence, or if it has any missing CDS regions.Transfers were considered invalid if the coding sequence had a missing CDS region or internal stop codon, or ncRNA sequences did not match between the source and transfer sequences.Coding sequences with mismatched protein sequences were not considered invalid and are flagged for future examination.
A final GFF3 of the transferred annotation is available at VectorBase as an Apollo genome browser track color coded by estimated transfer quality.A majority of genes transferred from each original source genome to its replacement assembly (Table 4).However, 30.3% and 22.0% were invalidated by missing CDS regions and internal stop codons, and 73.2% and 62.8% of CDS had mismatched protein sequences.That not all annotations could be transferred is likely unavoidable due to the differences in genome quality.

Data Records
Lutzomyia longipalpis PacBio HiFi 28 long reads and final assembly 29 are available at the NCBI with BioProject accession number PRJNA849274 30 .Lutzomyia longipalpis HiC short reads are available at the NCBI SRA 31 with BioProject accession number PRJNA512907 32 .Phlebotomus papatasi PacBio HiFi long reads 33 and final assembly 34 are available at the NCBI with BioProject accession numbers PRJNA657245 35 and PRJNA858452 36 respectively.Phlebotomus papatasi HiC short reads are available at the NCBI SRA 37 with BioProject accession number PRJNA512907 32 .Additional sub-accessions are shown in Table 1.

technical Validation
One of our aims was for these new genome references to meet the Earth BioGenome Project standards 38 despite the small amounts of input materials.Specifically, we aimed to have >1 Mb contig N50, and achieved full chromosome lengths using Hi-C data.
We assessed reference gene model completeness using BUSCO 39 (V3.0.2).For both sandfly references the diptera_odb10 set of 2,910 single copy orthologs are missing 225 (6.8%) of the genes (Table 3).This number decreases when the analysis is performed on larger taxonomic groups with smaller BUSCO gene sets.For example, only ~3% of genes (P. papatasi (42) and Lu.longipalpis (29)) are missing from the 1,367 insecta_odb10 BUSCO gene set.Whilst this is a vast improvement on the previous assemblies, future work is required to determine which missing genes are due to assembly problems such as gaps between 1 Mb N50 contigs or genuine gene loss during >150 million years of divergence time between these species and others in the orthoDB database at the current time 40,41 .
While assessing base coverage and GC content for P. papatasi, we noticed a blob that stood out from the rest of the Arthropoda hits, with several-fold less base coverage (accession #: CM045756.1).Hits for this "blob" included families Culicidae, Curculionidae, formicidae, Kalotermitidae, Noctuidae, and Drosophilidae.To assess for contamination, we blasted these regions against the NCBI nucleotide database.The top hits returned P. papatasi.To investigate the possibility of a sex chromosome, we blasted Y chromosome-linked scaffolds in Lu. longipalpis identified by Vigoder et al. against the NCBI nucleotide database 42 .While there were several P. papatasi hits, none were localized to this blob.Interestingly, other hits included the X chromosome for several different species of flies, three of which have an XY sex chromosome system.Finally, we blasted our blob of interest against the Drosophila Y chromosome (NC_024512.1).There was no significant similarity found.

Table 1 .
Genome data and global statistics.

Table 2 .
Comparison of old and new assembly statistics.

Table 3 .
BUSCO results for two new sandfly references.